{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 项目：评估和清理英国电商公司销售数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 分析目标"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "此数据分析的目的是，根据市场销售数据，挖掘畅销产品，以便制定更有效的市场策略来提升营收。\n",
    "\n",
    "本实战项目的目的在于练习评估数据干净和整洁度，并且基于评估结果，对数据进行清洗，从而得到可供下一步分析的数据。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 简介"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "原始数据集记录了一家英国在线零售公司在2010年12月1日至2011年12月9日期间的所有交易情况，涵盖了该公司在全球不同国家和地区的业务数据。该公司主要销售覆盖各个场景的礼品，包括但不限于生日礼品、结婚纪念品、圣诞礼品等等。该公司的客户群体主要包括批发商和个人消费者，其中批发商占据了相当大的比例。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据每列的含义如下：\n",
    "- `InvoiceNo`: 发票号码。6位数，作为交易的唯一标识符。如果这个代码以字母“c”开头，表示这笔交易被取消。\n",
    "- `StockCode`: 产品代码。5位数，作为产品的唯一标识符。\n",
    "- `Description`: 产品名称。\n",
    "- `Quantity`: 产品在交易中的数量。\n",
    "- `InvoiceDate`: 发票日期和时间。交易发生的日期和时间。\n",
    "- `UnitPrice`: 单价。价格单位为英镑（£）。\n",
    "- `CustomerID`: 客户编号。5位数，作为客户的唯一标识符。\n",
    "- `Country`: 国家名称。客户所居住的国家的名称。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 读取数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "导入数据分析所需要的库，并通过pandas的read_csv函数，将原始数据文件\"e_commerce.csv\"里的数据内容,解析为dataframe，并赋值给变量original_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "original_data = pd.read_csv(\"e_commerce.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>InvoiceNo</th>\n",
       "      <th>StockCode</th>\n",
       "      <th>Description</th>\n",
       "      <th>Quantity</th>\n",
       "      <th>InvoiceDate</th>\n",
       "      <th>UnitPrice</th>\n",
       "      <th>CustomerID</th>\n",
       "      <th>Country</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>536365</td>\n",
       "      <td>85123A</td>\n",
       "      <td>WHITE HANGING HEART T-LIGHT HOLDER</td>\n",
       "      <td>6</td>\n",
       "      <td>12/1/2010 8:26</td>\n",
       "      <td>2.55</td>\n",
       "      <td>17850.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>536365</td>\n",
       "      <td>71053</td>\n",
       "      <td>WHITE METAL LANTERN</td>\n",
       "      <td>6</td>\n",
       "      <td>12/1/2010 8:26</td>\n",
       "      <td>3.39</td>\n",
       "      <td>17850.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>536365</td>\n",
       "      <td>84406B</td>\n",
       "      <td>CREAM CUPID HEARTS COAT HANGER</td>\n",
       "      <td>8</td>\n",
       "      <td>12/1/2010 8:26</td>\n",
       "      <td>2.75</td>\n",
       "      <td>17850.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>536365</td>\n",
       "      <td>84029G</td>\n",
       "      <td>KNITTED UNION FLAG HOT WATER BOTTLE</td>\n",
       "      <td>6</td>\n",
       "      <td>12/1/2010 8:26</td>\n",
       "      <td>3.39</td>\n",
       "      <td>17850.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>536365</td>\n",
       "      <td>84029E</td>\n",
       "      <td>RED WOOLLY HOTTIE WHITE HEART.</td>\n",
       "      <td>6</td>\n",
       "      <td>12/1/2010 8:26</td>\n",
       "      <td>3.39</td>\n",
       "      <td>17850.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  InvoiceNo StockCode                          Description  Quantity  \\\n",
       "0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   \n",
       "1    536365     71053                  WHITE METAL LANTERN         6   \n",
       "2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   \n",
       "3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   \n",
       "4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   \n",
       "\n",
       "      InvoiceDate  UnitPrice  CustomerID         Country  \n",
       "0  12/1/2010 8:26       2.55     17850.0  United Kingdom  \n",
       "1  12/1/2010 8:26       3.39     17850.0  United Kingdom  \n",
       "2  12/1/2010 8:26       2.75     17850.0  United Kingdom  \n",
       "3  12/1/2010 8:26       3.39     17850.0  United Kingdom  \n",
       "4  12/1/2010 8:26       3.39     17850.0  United Kingdom  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "original_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 评估数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这一部分，将对上一部分建立的original_data这个DataFrame所包含的数据进行评估\n",
    "\n",
    "评估主要从两个方面进行：结构和内容，即整齐度和干净度。数据的结构性问题指不符合“每列是一个变量，每行是一个观察值，每个单元格是一个值”这三个标准，数据的内容性问题包括讯在丢失数据，重复数据，无效数据等。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 评估数据整齐度"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>InvoiceNo</th>\n",
       "      <th>StockCode</th>\n",
       "      <th>Description</th>\n",
       "      <th>Quantity</th>\n",
       "      <th>InvoiceDate</th>\n",
       "      <th>UnitPrice</th>\n",
       "      <th>CustomerID</th>\n",
       "      <th>Country</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>343200</th>\n",
       "      <td>566949</td>\n",
       "      <td>23353</td>\n",
       "      <td>6 GIFT TAGS VINTAGE CHRISTMAS</td>\n",
       "      <td>2</td>\n",
       "      <td>9/15/2011 16:32</td>\n",
       "      <td>1.63</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149893</th>\n",
       "      <td>549312</td>\n",
       "      <td>22763</td>\n",
       "      <td>KEY CABINET MA CAMPAGNE</td>\n",
       "      <td>2</td>\n",
       "      <td>4/8/2011 9:26</td>\n",
       "      <td>9.95</td>\n",
       "      <td>14217.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>514926</th>\n",
       "      <td>579751</td>\n",
       "      <td>22385</td>\n",
       "      <td>JUMBO BAG SPACEBOY DESIGN</td>\n",
       "      <td>1</td>\n",
       "      <td>11/30/2011 14:46</td>\n",
       "      <td>4.13</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>439647</th>\n",
       "      <td>574478</td>\n",
       "      <td>23077</td>\n",
       "      <td>DOUGHNUT LIP GLOSS</td>\n",
       "      <td>20</td>\n",
       "      <td>11/4/2011 12:27</td>\n",
       "      <td>1.25</td>\n",
       "      <td>17203.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>499971</th>\n",
       "      <td>578703</td>\n",
       "      <td>85034C</td>\n",
       "      <td>3 ROSE MORRIS BOXED CANDLES</td>\n",
       "      <td>3</td>\n",
       "      <td>11/25/2011 11:02</td>\n",
       "      <td>1.25</td>\n",
       "      <td>17385.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>177040</th>\n",
       "      <td>552039</td>\n",
       "      <td>21164</td>\n",
       "      <td>HOME SWEET HOME METAL SIGN</td>\n",
       "      <td>12</td>\n",
       "      <td>5/6/2011 8:10</td>\n",
       "      <td>2.95</td>\n",
       "      <td>13081.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>335049</th>\n",
       "      <td>566275</td>\n",
       "      <td>20724</td>\n",
       "      <td>RED RETROSPOT CHARLOTTE BAG</td>\n",
       "      <td>4</td>\n",
       "      <td>9/11/2011 13:58</td>\n",
       "      <td>0.85</td>\n",
       "      <td>16549.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>204754</th>\n",
       "      <td>554694</td>\n",
       "      <td>22139</td>\n",
       "      <td>RETROSPOT TEA SET CERAMIC 11 PC</td>\n",
       "      <td>3</td>\n",
       "      <td>5/25/2011 17:06</td>\n",
       "      <td>10.79</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>369257</th>\n",
       "      <td>569022</td>\n",
       "      <td>23027</td>\n",
       "      <td>DRAWER KNOB VINTAGE GLASS HEXAGON</td>\n",
       "      <td>12</td>\n",
       "      <td>9/30/2011 9:45</td>\n",
       "      <td>2.08</td>\n",
       "      <td>17725.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>248345</th>\n",
       "      <td>558811</td>\n",
       "      <td>22488</td>\n",
       "      <td>NATURAL SLATE RECTANGLE CHALKBOARD</td>\n",
       "      <td>1</td>\n",
       "      <td>7/4/2011 11:11</td>\n",
       "      <td>3.29</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       InvoiceNo StockCode                         Description  Quantity  \\\n",
       "343200    566949     23353      6 GIFT TAGS VINTAGE CHRISTMAS          2   \n",
       "149893    549312     22763             KEY CABINET MA CAMPAGNE         2   \n",
       "514926    579751     22385           JUMBO BAG SPACEBOY DESIGN         1   \n",
       "439647    574478     23077                 DOUGHNUT LIP GLOSS         20   \n",
       "499971    578703    85034C         3 ROSE MORRIS BOXED CANDLES         3   \n",
       "177040    552039     21164         HOME SWEET HOME METAL SIGN         12   \n",
       "335049    566275     20724         RED RETROSPOT CHARLOTTE BAG         4   \n",
       "204754    554694     22139    RETROSPOT TEA SET CERAMIC 11 PC          3   \n",
       "369257    569022     23027   DRAWER KNOB VINTAGE GLASS HEXAGON        12   \n",
       "248345    558811     22488  NATURAL SLATE RECTANGLE CHALKBOARD         1   \n",
       "\n",
       "             InvoiceDate  UnitPrice  CustomerID         Country  \n",
       "343200   9/15/2011 16:32       1.63         NaN  United Kingdom  \n",
       "149893     4/8/2011 9:26       9.95     14217.0  United Kingdom  \n",
       "514926  11/30/2011 14:46       4.13         NaN  United Kingdom  \n",
       "439647   11/4/2011 12:27       1.25     17203.0  United Kingdom  \n",
       "499971  11/25/2011 11:02       1.25     17385.0  United Kingdom  \n",
       "177040     5/6/2011 8:10       2.95     13081.0  United Kingdom  \n",
       "335049   9/11/2011 13:58       0.85     16549.0  United Kingdom  \n",
       "204754   5/25/2011 17:06      10.79         NaN  United Kingdom  \n",
       "369257    9/30/2011 9:45       2.08     17725.0  United Kingdom  \n",
       "248345    7/4/2011 11:11       3.29         NaN  United Kingdom  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "original_data.sample(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从抽样的10行数据来看，数据符合“每列是一个变量，每行是一个观察值，每个单元格是一个值”，具体来看每行是关于某商品的一次交易，每列是交易相关的各个变量，因此不存在结构性问题"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 评估数据干净度"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 541909 entries, 0 to 541908\n",
      "Data columns (total 8 columns):\n",
      " #   Column       Non-Null Count   Dtype  \n",
      "---  ------       --------------   -----  \n",
      " 0   InvoiceNo    541909 non-null  object \n",
      " 1   StockCode    541909 non-null  object \n",
      " 2   Description  540455 non-null  object \n",
      " 3   Quantity     541909 non-null  int64  \n",
      " 4   InvoiceDate  541909 non-null  object \n",
      " 5   UnitPrice    541909 non-null  float64\n",
      " 6   CustomerID   406829 non-null  float64\n",
      " 7   Country      541909 non-null  object \n",
      "dtypes: float64(2), int64(1), object(5)\n",
      "memory usage: 33.1+ MB\n"
     ]
    }
   ],
   "source": [
    "original_data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从输出结果来看，数据共有541909条观察值，而Description、CustomerID变量存在缺失值\n",
    "\n",
    "此外，InvoiceDate的数据类型应为日期，CustomerID的数据类型应为字符串，应当进行数据格式转换。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 评估缺失数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在了解Description存在缺失值后，根据条件提取出缺失观察值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>InvoiceNo</th>\n",
       "      <th>StockCode</th>\n",
       "      <th>Description</th>\n",
       "      <th>Quantity</th>\n",
       "      <th>InvoiceDate</th>\n",
       "      <th>UnitPrice</th>\n",
       "      <th>CustomerID</th>\n",
       "      <th>Country</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>622</th>\n",
       "      <td>536414</td>\n",
       "      <td>22139</td>\n",
       "      <td>NaN</td>\n",
       "      <td>56</td>\n",
       "      <td>12/1/2010 11:52</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1970</th>\n",
       "      <td>536545</td>\n",
       "      <td>21134</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>12/1/2010 14:32</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1971</th>\n",
       "      <td>536546</td>\n",
       "      <td>22145</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>12/1/2010 14:33</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1972</th>\n",
       "      <td>536547</td>\n",
       "      <td>37509</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>12/1/2010 14:33</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1987</th>\n",
       "      <td>536549</td>\n",
       "      <td>85226A</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>12/1/2010 14:34</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>535322</th>\n",
       "      <td>581199</td>\n",
       "      <td>84581</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-2</td>\n",
       "      <td>12/7/2011 18:26</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>535326</th>\n",
       "      <td>581203</td>\n",
       "      <td>23406</td>\n",
       "      <td>NaN</td>\n",
       "      <td>15</td>\n",
       "      <td>12/7/2011 18:31</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>535332</th>\n",
       "      <td>581209</td>\n",
       "      <td>21620</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6</td>\n",
       "      <td>12/7/2011 18:35</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>536981</th>\n",
       "      <td>581234</td>\n",
       "      <td>72817</td>\n",
       "      <td>NaN</td>\n",
       "      <td>27</td>\n",
       "      <td>12/8/2011 10:33</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>538554</th>\n",
       "      <td>581408</td>\n",
       "      <td>85175</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20</td>\n",
       "      <td>12/8/2011 14:06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1454 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       InvoiceNo StockCode Description  Quantity      InvoiceDate  UnitPrice  \\\n",
       "622       536414     22139         NaN        56  12/1/2010 11:52        0.0   \n",
       "1970      536545     21134         NaN         1  12/1/2010 14:32        0.0   \n",
       "1971      536546     22145         NaN         1  12/1/2010 14:33        0.0   \n",
       "1972      536547     37509         NaN         1  12/1/2010 14:33        0.0   \n",
       "1987      536549    85226A         NaN         1  12/1/2010 14:34        0.0   \n",
       "...          ...       ...         ...       ...              ...        ...   \n",
       "535322    581199     84581         NaN        -2  12/7/2011 18:26        0.0   \n",
       "535326    581203     23406         NaN        15  12/7/2011 18:31        0.0   \n",
       "535332    581209     21620         NaN         6  12/7/2011 18:35        0.0   \n",
       "536981    581234     72817         NaN        27  12/8/2011 10:33        0.0   \n",
       "538554    581408     85175         NaN        20  12/8/2011 14:06        0.0   \n",
       "\n",
       "        CustomerID         Country  \n",
       "622            NaN  United Kingdom  \n",
       "1970           NaN  United Kingdom  \n",
       "1971           NaN  United Kingdom  \n",
       "1972           NaN  United Kingdom  \n",
       "1987           NaN  United Kingdom  \n",
       "...            ...             ...  \n",
       "535322         NaN  United Kingdom  \n",
       "535326         NaN  United Kingdom  \n",
       "535332         NaN  United Kingdom  \n",
       "536981         NaN  United Kingdom  \n",
       "538554         NaN  United Kingdom  \n",
       "\n",
       "[1454 rows x 8 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "original_data[original_data[\"Description\"].isnull()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有1454条交易数据缺失Description变量值\n",
    "\n",
    "从输出结果来看，这些缺失Description的交易数据，UnitPrice都为0。为了验证猜想，增加筛选条件，看看是否存在Description变量缺失且UnitPrice不为0的数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>InvoiceNo</th>\n",
       "      <th>StockCode</th>\n",
       "      <th>Description</th>\n",
       "      <th>Quantity</th>\n",
       "      <th>InvoiceDate</th>\n",
       "      <th>UnitPrice</th>\n",
       "      <th>CustomerID</th>\n",
       "      <th>Country</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country]\n",
       "Index: []"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "original_data[(original_data[\"Description\"].isnull()) & (original_data[\"UnitPrice\"] != 0)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "筛选出来结果数量为0条，说明缺失Descripition值的数据，同时也不具备有效的UnitPrice值\n",
    "\n",
    "Descripition表示产品名称，UnitPrice表示产品单价，都是进行后续商品交易分析的重要变量。如果它们同时缺失/无效，我们认为数据无法提供有效含义，因此这些后续可以被删除。\n",
    "\n",
    "CustomerID变量同样存在缺失值，因此也根据条件提取出缺失观察值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>InvoiceNo</th>\n",
       "      <th>StockCode</th>\n",
       "      <th>Description</th>\n",
       "      <th>Quantity</th>\n",
       "      <th>InvoiceDate</th>\n",
       "      <th>UnitPrice</th>\n",
       "      <th>CustomerID</th>\n",
       "      <th>Country</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>622</th>\n",
       "      <td>536414</td>\n",
       "      <td>22139</td>\n",
       "      <td>NaN</td>\n",
       "      <td>56</td>\n",
       "      <td>12/1/2010 11:52</td>\n",
       "      <td>0.00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1443</th>\n",
       "      <td>536544</td>\n",
       "      <td>21773</td>\n",
       "      <td>DECORATIVE ROSE BATHROOM BOTTLE</td>\n",
       "      <td>1</td>\n",
       "      <td>12/1/2010 14:32</td>\n",
       "      <td>2.51</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1444</th>\n",
       "      <td>536544</td>\n",
       "      <td>21774</td>\n",
       "      <td>DECORATIVE CATS BATHROOM BOTTLE</td>\n",
       "      <td>2</td>\n",
       "      <td>12/1/2010 14:32</td>\n",
       "      <td>2.51</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1445</th>\n",
       "      <td>536544</td>\n",
       "      <td>21786</td>\n",
       "      <td>POLKADOT RAIN HAT</td>\n",
       "      <td>4</td>\n",
       "      <td>12/1/2010 14:32</td>\n",
       "      <td>0.85</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1446</th>\n",
       "      <td>536544</td>\n",
       "      <td>21787</td>\n",
       "      <td>RAIN PONCHO RETROSPOT</td>\n",
       "      <td>2</td>\n",
       "      <td>12/1/2010 14:32</td>\n",
       "      <td>1.66</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>541536</th>\n",
       "      <td>581498</td>\n",
       "      <td>85099B</td>\n",
       "      <td>JUMBO BAG RED RETROSPOT</td>\n",
       "      <td>5</td>\n",
       "      <td>12/9/2011 10:26</td>\n",
       "      <td>4.13</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>541537</th>\n",
       "      <td>581498</td>\n",
       "      <td>85099C</td>\n",
       "      <td>JUMBO  BAG BAROQUE BLACK WHITE</td>\n",
       "      <td>4</td>\n",
       "      <td>12/9/2011 10:26</td>\n",
       "      <td>4.13</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>541538</th>\n",
       "      <td>581498</td>\n",
       "      <td>85150</td>\n",
       "      <td>LADIES &amp; GENTLEMEN METAL SIGN</td>\n",
       "      <td>1</td>\n",
       "      <td>12/9/2011 10:26</td>\n",
       "      <td>4.96</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>541539</th>\n",
       "      <td>581498</td>\n",
       "      <td>85174</td>\n",
       "      <td>S/4 CACTI CANDLES</td>\n",
       "      <td>1</td>\n",
       "      <td>12/9/2011 10:26</td>\n",
       "      <td>10.79</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>541540</th>\n",
       "      <td>581498</td>\n",
       "      <td>DOT</td>\n",
       "      <td>DOTCOM POSTAGE</td>\n",
       "      <td>1</td>\n",
       "      <td>12/9/2011 10:26</td>\n",
       "      <td>1714.17</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>135080 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       InvoiceNo StockCode                      Description  Quantity  \\\n",
       "622       536414     22139                              NaN        56   \n",
       "1443      536544     21773  DECORATIVE ROSE BATHROOM BOTTLE         1   \n",
       "1444      536544     21774  DECORATIVE CATS BATHROOM BOTTLE         2   \n",
       "1445      536544     21786               POLKADOT RAIN HAT          4   \n",
       "1446      536544     21787            RAIN PONCHO RETROSPOT         2   \n",
       "...          ...       ...                              ...       ...   \n",
       "541536    581498    85099B          JUMBO BAG RED RETROSPOT         5   \n",
       "541537    581498    85099C   JUMBO  BAG BAROQUE BLACK WHITE         4   \n",
       "541538    581498     85150    LADIES & GENTLEMEN METAL SIGN         1   \n",
       "541539    581498     85174                S/4 CACTI CANDLES         1   \n",
       "541540    581498       DOT                   DOTCOM POSTAGE         1   \n",
       "\n",
       "            InvoiceDate  UnitPrice  CustomerID         Country  \n",
       "622     12/1/2010 11:52       0.00         NaN  United Kingdom  \n",
       "1443    12/1/2010 14:32       2.51         NaN  United Kingdom  \n",
       "1444    12/1/2010 14:32       2.51         NaN  United Kingdom  \n",
       "1445    12/1/2010 14:32       0.85         NaN  United Kingdom  \n",
       "1446    12/1/2010 14:32       1.66         NaN  United Kingdom  \n",
       "...                 ...        ...         ...             ...  \n",
       "541536  12/9/2011 10:26       4.13         NaN  United Kingdom  \n",
       "541537  12/9/2011 10:26       4.13         NaN  United Kingdom  \n",
       "541538  12/9/2011 10:26       4.96         NaN  United Kingdom  \n",
       "541539  12/9/2011 10:26      10.79         NaN  United Kingdom  \n",
       "541540  12/9/2011 10:26    1714.17         NaN  United Kingdom  \n",
       "\n",
       "[135080 rows x 8 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "original_data[original_data[\"CustomerID\"].isnull()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "CustomerID表示客户编号，不是分析畅销商品的必要变量。并且从输出结果来看，有些CustomerID缺失的交易数据仍然有效，因为保留此变量为空的观察值"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 评估重复数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "根据数据变量含义来看，虽然InvoiceNo、StockCode和CustomerID都是唯一标识符，但是一次交易可能包含多见商品，因此InvoiceNo可以存在重复，顾客可以进行多次交易或下单多个商品，因此CustomerID也可以存在重复。\n",
    "\n",
    "那么针对此数据集，我们无需评估重复数据。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 评估不一致数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "不一致数据可以存在于Country变量中，我们要查看是否存在多个不同指代同一国家的情况"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Country\n",
       "United Kingdom          495266\n",
       "Germany                   9495\n",
       "France                    8557\n",
       "EIRE                      8196\n",
       "Spain                     2533\n",
       "Netherlands               2371\n",
       "Belgium                   2069\n",
       "Switzerland               2002\n",
       "Portugal                  1519\n",
       "Australia                 1259\n",
       "Norway                    1086\n",
       "Italy                      803\n",
       "Channel Islands            758\n",
       "Finland                    695\n",
       "Cyprus                     622\n",
       "Sweden                     462\n",
       "Unspecified                446\n",
       "Austria                    401\n",
       "Denmark                    389\n",
       "Japan                      358\n",
       "Poland                     341\n",
       "Israel                     297\n",
       "China                      288\n",
       "Singapore                  229\n",
       "USA                        218\n",
       "UK                         211\n",
       "Iceland                    182\n",
       "Canada                     151\n",
       "Greece                     146\n",
       "Malta                      127\n",
       "United States               73\n",
       "United Arab Emirates        68\n",
       "European Community          61\n",
       "RSA                         58\n",
       "Lebanon                     45\n",
       "Lithuania                   35\n",
       "Brazil                      32\n",
       "Czech Republic              30\n",
       "Bahrain                     19\n",
       "Saudi Arabia                10\n",
       "U.K.                         1\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "original_data[\"Country\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从country变量值来看，“USA”、“United States”均表示美国，“United Kingdom”、“UK”、“U.K.” 均表示英国，因此应该在清洗步骤对这些值进行统一，只保留一个指代值。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 评估无效或错误数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Quantity</th>\n",
       "      <th>UnitPrice</th>\n",
       "      <th>CustomerID</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>541909.000000</td>\n",
       "      <td>541909.000000</td>\n",
       "      <td>406829.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>9.552250</td>\n",
       "      <td>4.611114</td>\n",
       "      <td>15287.690570</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>218.081158</td>\n",
       "      <td>96.759853</td>\n",
       "      <td>1713.600303</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>-80995.000000</td>\n",
       "      <td>-11062.060000</td>\n",
       "      <td>12346.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.250000</td>\n",
       "      <td>13953.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>2.080000</td>\n",
       "      <td>15152.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>10.000000</td>\n",
       "      <td>4.130000</td>\n",
       "      <td>16791.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>80995.000000</td>\n",
       "      <td>38970.000000</td>\n",
       "      <td>18287.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            Quantity      UnitPrice     CustomerID\n",
       "count  541909.000000  541909.000000  406829.000000\n",
       "mean        9.552250       4.611114   15287.690570\n",
       "std       218.081158      96.759853    1713.600303\n",
       "min    -80995.000000  -11062.060000   12346.000000\n",
       "25%         1.000000       1.250000   13953.000000\n",
       "50%         3.000000       2.080000   15152.000000\n",
       "75%        10.000000       4.130000   16791.000000\n",
       "max     80995.000000   38970.000000   18287.000000"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "original_data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从输出结果看，Quantity和UnitPrice存在负数，会对后续数值分析产生影响\n",
    "\n",
    "因此，先筛选出Quantity数值为负数的观察值，进一步评估其含义。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>InvoiceNo</th>\n",
       "      <th>StockCode</th>\n",
       "      <th>Description</th>\n",
       "      <th>Quantity</th>\n",
       "      <th>InvoiceDate</th>\n",
       "      <th>UnitPrice</th>\n",
       "      <th>CustomerID</th>\n",
       "      <th>Country</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>141</th>\n",
       "      <td>C536379</td>\n",
       "      <td>D</td>\n",
       "      <td>Discount</td>\n",
       "      <td>-1</td>\n",
       "      <td>12/1/2010 9:41</td>\n",
       "      <td>27.50</td>\n",
       "      <td>14527.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>154</th>\n",
       "      <td>C536383</td>\n",
       "      <td>35004C</td>\n",
       "      <td>SET OF 3 COLOURED  FLYING DUCKS</td>\n",
       "      <td>-1</td>\n",
       "      <td>12/1/2010 9:49</td>\n",
       "      <td>4.65</td>\n",
       "      <td>15311.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>235</th>\n",
       "      <td>C536391</td>\n",
       "      <td>22556</td>\n",
       "      <td>PLASTERS IN TIN CIRCUS PARADE</td>\n",
       "      <td>-12</td>\n",
       "      <td>12/1/2010 10:24</td>\n",
       "      <td>1.65</td>\n",
       "      <td>17548.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>236</th>\n",
       "      <td>C536391</td>\n",
       "      <td>21984</td>\n",
       "      <td>PACK OF 12 PINK PAISLEY TISSUES</td>\n",
       "      <td>-24</td>\n",
       "      <td>12/1/2010 10:24</td>\n",
       "      <td>0.29</td>\n",
       "      <td>17548.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>237</th>\n",
       "      <td>C536391</td>\n",
       "      <td>21983</td>\n",
       "      <td>PACK OF 12 BLUE PAISLEY TISSUES</td>\n",
       "      <td>-24</td>\n",
       "      <td>12/1/2010 10:24</td>\n",
       "      <td>0.29</td>\n",
       "      <td>17548.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>540449</th>\n",
       "      <td>C581490</td>\n",
       "      <td>23144</td>\n",
       "      <td>ZINC T-LIGHT HOLDER STARS SMALL</td>\n",
       "      <td>-11</td>\n",
       "      <td>12/9/2011 9:57</td>\n",
       "      <td>0.83</td>\n",
       "      <td>14397.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>541541</th>\n",
       "      <td>C581499</td>\n",
       "      <td>M</td>\n",
       "      <td>Manual</td>\n",
       "      <td>-1</td>\n",
       "      <td>12/9/2011 10:28</td>\n",
       "      <td>224.69</td>\n",
       "      <td>15498.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>541715</th>\n",
       "      <td>C581568</td>\n",
       "      <td>21258</td>\n",
       "      <td>VICTORIAN SEWING BOX LARGE</td>\n",
       "      <td>-5</td>\n",
       "      <td>12/9/2011 11:57</td>\n",
       "      <td>10.95</td>\n",
       "      <td>15311.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>541716</th>\n",
       "      <td>C581569</td>\n",
       "      <td>84978</td>\n",
       "      <td>HANGING HEART JAR T-LIGHT HOLDER</td>\n",
       "      <td>-1</td>\n",
       "      <td>12/9/2011 11:58</td>\n",
       "      <td>1.25</td>\n",
       "      <td>17315.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>541717</th>\n",
       "      <td>C581569</td>\n",
       "      <td>20979</td>\n",
       "      <td>36 PENCILS TUBE RED RETROSPOT</td>\n",
       "      <td>-5</td>\n",
       "      <td>12/9/2011 11:58</td>\n",
       "      <td>1.25</td>\n",
       "      <td>17315.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10624 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       InvoiceNo StockCode                       Description  Quantity  \\\n",
       "141      C536379         D                          Discount        -1   \n",
       "154      C536383    35004C   SET OF 3 COLOURED  FLYING DUCKS        -1   \n",
       "235      C536391     22556    PLASTERS IN TIN CIRCUS PARADE        -12   \n",
       "236      C536391     21984  PACK OF 12 PINK PAISLEY TISSUES        -24   \n",
       "237      C536391     21983  PACK OF 12 BLUE PAISLEY TISSUES        -24   \n",
       "...          ...       ...                               ...       ...   \n",
       "540449   C581490     23144   ZINC T-LIGHT HOLDER STARS SMALL       -11   \n",
       "541541   C581499         M                            Manual        -1   \n",
       "541715   C581568     21258        VICTORIAN SEWING BOX LARGE        -5   \n",
       "541716   C581569     84978  HANGING HEART JAR T-LIGHT HOLDER        -1   \n",
       "541717   C581569     20979     36 PENCILS TUBE RED RETROSPOT        -5   \n",
       "\n",
       "            InvoiceDate  UnitPrice  CustomerID         Country  \n",
       "141      12/1/2010 9:41      27.50     14527.0  United Kingdom  \n",
       "154      12/1/2010 9:49       4.65     15311.0  United Kingdom  \n",
       "235     12/1/2010 10:24       1.65     17548.0  United Kingdom  \n",
       "236     12/1/2010 10:24       0.29     17548.0  United Kingdom  \n",
       "237     12/1/2010 10:24       0.29     17548.0  United Kingdom  \n",
       "...                 ...        ...         ...             ...  \n",
       "540449   12/9/2011 9:57       0.83     14397.0  United Kingdom  \n",
       "541541  12/9/2011 10:28     224.69     15498.0  United Kingdom  \n",
       "541715  12/9/2011 11:57      10.95     15311.0  United Kingdom  \n",
       "541716  12/9/2011 11:58       1.25     17315.0  United Kingdom  \n",
       "541717  12/9/2011 11:58       1.25     17315.0  United Kingdom  \n",
       "\n",
       "[10624 rows x 8 columns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "original_data[original_data[\"Quantity\"] < 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>InvoiceNo</th>\n",
       "      <th>StockCode</th>\n",
       "      <th>Description</th>\n",
       "      <th>Quantity</th>\n",
       "      <th>InvoiceDate</th>\n",
       "      <th>UnitPrice</th>\n",
       "      <th>CustomerID</th>\n",
       "      <th>Country</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2406</th>\n",
       "      <td>536589</td>\n",
       "      <td>21777</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-10</td>\n",
       "      <td>12/1/2010 16:50</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4347</th>\n",
       "      <td>536764</td>\n",
       "      <td>84952C</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-38</td>\n",
       "      <td>12/2/2010 14:42</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7188</th>\n",
       "      <td>536996</td>\n",
       "      <td>22712</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-20</td>\n",
       "      <td>12/3/2010 15:30</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7189</th>\n",
       "      <td>536997</td>\n",
       "      <td>22028</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-20</td>\n",
       "      <td>12/3/2010 15:30</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7190</th>\n",
       "      <td>536998</td>\n",
       "      <td>85067</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-6</td>\n",
       "      <td>12/3/2010 15:30</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>535333</th>\n",
       "      <td>581210</td>\n",
       "      <td>23395</td>\n",
       "      <td>check</td>\n",
       "      <td>-26</td>\n",
       "      <td>12/7/2011 18:36</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>535335</th>\n",
       "      <td>581212</td>\n",
       "      <td>22578</td>\n",
       "      <td>lost</td>\n",
       "      <td>-1050</td>\n",
       "      <td>12/7/2011 18:38</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>535336</th>\n",
       "      <td>581213</td>\n",
       "      <td>22576</td>\n",
       "      <td>check</td>\n",
       "      <td>-30</td>\n",
       "      <td>12/7/2011 18:38</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>536908</th>\n",
       "      <td>581226</td>\n",
       "      <td>23090</td>\n",
       "      <td>missing</td>\n",
       "      <td>-338</td>\n",
       "      <td>12/8/2011 9:56</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>538919</th>\n",
       "      <td>581422</td>\n",
       "      <td>23169</td>\n",
       "      <td>smashed</td>\n",
       "      <td>-235</td>\n",
       "      <td>12/8/2011 15:24</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1336 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       InvoiceNo StockCode Description  Quantity      InvoiceDate  UnitPrice  \\\n",
       "2406      536589     21777         NaN       -10  12/1/2010 16:50        0.0   \n",
       "4347      536764    84952C         NaN       -38  12/2/2010 14:42        0.0   \n",
       "7188      536996     22712         NaN       -20  12/3/2010 15:30        0.0   \n",
       "7189      536997     22028         NaN       -20  12/3/2010 15:30        0.0   \n",
       "7190      536998     85067         NaN        -6  12/3/2010 15:30        0.0   \n",
       "...          ...       ...         ...       ...              ...        ...   \n",
       "535333    581210     23395       check       -26  12/7/2011 18:36        0.0   \n",
       "535335    581212     22578        lost     -1050  12/7/2011 18:38        0.0   \n",
       "535336    581213     22576       check       -30  12/7/2011 18:38        0.0   \n",
       "536908    581226     23090     missing      -338   12/8/2011 9:56        0.0   \n",
       "538919    581422     23169     smashed      -235  12/8/2011 15:24        0.0   \n",
       "\n",
       "        CustomerID         Country  \n",
       "2406           NaN  United Kingdom  \n",
       "4347           NaN  United Kingdom  \n",
       "7188           NaN  United Kingdom  \n",
       "7189           NaN  United Kingdom  \n",
       "7190           NaN  United Kingdom  \n",
       "...            ...             ...  \n",
       "535333         NaN  United Kingdom  \n",
       "535335         NaN  United Kingdom  \n",
       "535336         NaN  United Kingdom  \n",
       "536908         NaN  United Kingdom  \n",
       "538919         NaN  United Kingdom  \n",
       "\n",
       "[1336 rows x 8 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "original_data[(original_data[\"Quantity\"] < 0) & (original_data[\"InvoiceNo\"].str[0] != \"C\")]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "以上猜想被证实错误，因为还存在Quantity变量为负数且InvoiceNo不以C开头的观察值。\n",
    "\n",
    "但根据以上输出结果，这些筛选出的观察值的UnitPrice观察值均为0，因此增加UnitPrice的条件进行验证。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>InvoiceNo</th>\n",
       "      <th>StockCode</th>\n",
       "      <th>Description</th>\n",
       "      <th>Quantity</th>\n",
       "      <th>InvoiceDate</th>\n",
       "      <th>UnitPrice</th>\n",
       "      <th>CustomerID</th>\n",
       "      <th>Country</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country]\n",
       "Index: []"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "original_data[(original_data[\"Quantity\"] < 0) & (original_data[\"InvoiceNo\"].str[0] != \"C\")&(original_data[\"UnitPrice\"] != 0)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "根据输出结果，猜想得到验证，当Quantity变量为负数的时候，观察值满足以下条件之一：\n",
    "\n",
    " InvoiceNo以C开头，表示订单取消\n",
    " \n",
    " UnitPrice为0，说明单价为0英镑\n",
    "\n",
    "这些交易数据均不是有效成交数据，不贡献销售，不在后续分析范围内，因此我们将在数据清洗步骤，将Quantity为负数的观察值删除。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>InvoiceNo</th>\n",
       "      <th>StockCode</th>\n",
       "      <th>Description</th>\n",
       "      <th>Quantity</th>\n",
       "      <th>InvoiceDate</th>\n",
       "      <th>UnitPrice</th>\n",
       "      <th>CustomerID</th>\n",
       "      <th>Country</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>299983</th>\n",
       "      <td>A563186</td>\n",
       "      <td>B</td>\n",
       "      <td>Adjust bad debt</td>\n",
       "      <td>1</td>\n",
       "      <td>8/12/2011 14:51</td>\n",
       "      <td>-11062.06</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>299984</th>\n",
       "      <td>A563187</td>\n",
       "      <td>B</td>\n",
       "      <td>Adjust bad debt</td>\n",
       "      <td>1</td>\n",
       "      <td>8/12/2011 14:52</td>\n",
       "      <td>-11062.06</td>\n",
       "      <td>NaN</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       InvoiceNo StockCode      Description  Quantity      InvoiceDate  \\\n",
       "299983   A563186         B  Adjust bad debt         1  8/12/2011 14:51   \n",
       "299984   A563187         B  Adjust bad debt         1  8/12/2011 14:52   \n",
       "\n",
       "        UnitPrice  CustomerID         Country  \n",
       "299983  -11062.06         NaN  United Kingdom  \n",
       "299984  -11062.06         NaN  United Kingdom  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "original_data[(original_data[\"UnitPrice\"] < 0)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从输出结果来看，UnitPrice为负数的观察值都是坏账调账，不属于实际商品交易数据，因此也在数据清洗步骤中将其删除"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
